Green-Thread Actor Runtime

Erlang's isolation model. Rust's zero-copy ownership. No function colouring.

smarm is a prototype concurrent runtime for Rust. Each actor is a green thread with its own mmap'd stack. N OS threads share a single global run queue. Actors communicate exclusively via message passing (owned values over channels); no shared mutable state without an explicit Arc<Mutex<T>>.

Preemption is allocator-driven: every Nth heap allocation, smarm reads RDTSC and yields the actor if its timeslice has expired. No OS signals, no separate timer thread for scheduling.

vs async/await

No function colouring. No Box<dyn Future>. No poll state machines. Just plain Rust functions that block.

vs OS threads

64 KB stacks instead of 8 MB. Context switch in ~10–20 ns (6 GPR saves + ret) instead of kernel mode.

vs Erlang BEAM

Zero-copy ownership via Rust's type system. No GC pause. No copying GC. Message passing is a move, not a clone.

Module Map

13 source modules, three rough layers. The bottom layer has zero smarm dependencies; middle layer builds the runtime machinery; top layer is public API.

LAYER 0 — PRIMITIVES LAYER 1 — RUNTIME MACHINERY LAYER 2 — PUBLIC API / FACADE stack mmap + guard context naked asm CSW preempt alloc hook + RDTSC pid (index, gen) pair timer min-heap supervisor Signal enum only trace Chrome JSON opt actor trampoline + TLs io epoll + pool thread channel MPSC, park/unpark mutex timeout + FIFO runtime SharedState + loop scheduler public API facade lib.rs re-exports + GlobalAlloc uses directly uses via type (Pid etc) public API edge
stack
Layer 0 · primitive

Calls mmap for a contiguous region, then mprotect's the bottom page to PROT_NONE. Stack grows downward; overflow hits the guard page → SIGSEGV. Implements Drop via munmap. Zero smarm dependencies.

context
Layer 0 · primitive

Two #[naked] assembly functions (switch_to_actor, switch_to_scheduler). Save 6 callee-saved GPRs, swap rsp, restore, ret. Thread-locals hold each side's saved stack pointer. XMM registers not saved here — compiler guarantees spill at Rust call sites.

preempt
Layer 0 · primitive

Implements GlobalAlloc — wraps System allocator. On every Nth alloc, reads RDTSC. If elapsed > timeslice_cycles and preemption is enabled, calls switch_to_scheduler(). Thread-locals hold the countdown, start timestamp, and an enabled flag (scheduler disables it to prevent self-preemption).

pid
Layer 0 · primitive

struct Pid(u32 index, u32 generation). Index = slot in the actor table. Generation increments on actor death. Stale handles are detectable: a Pid with wrong generation fails slot lookup rather than silently addressing a new actor. Solves ABA without exhausting PID space.

actor
Layer 1 · machinery

Owns the Stack. Defines the trampoline: every actor's first ret lands here. Trampoline reads the closure from a thread-local, calls it inside catch_unwind, writes the Outcome to another thread-local, then yields back to the scheduler. Thread-locals: current PID, pending closure, last outcome, done flag.

runtime
Layer 1 · core

The heaviest module. Contains SharedState (slot table, run queue, timers, IO), RuntimeInner (shared state behind a mutex, per-thread stats, drain lock), and schedule_loop — the main scheduler loop that drains timers, drains IO completions, pops actors, resumes them, and handles the post-yield intent (re-queue vs park vs finalize).

channel
Layer 1 · primitive

Unbounded MPSC. Inner state is Arc<Mutex<Inner<T>>> — senders are clonable, last drop closes channel. recv(): checks queue; if empty, registers self as parked_receiver, releases the lock, calls park_current(). send(): pushes, takes the parked PID, calls unpark(pid).

mutex
Layer 1 · primitive

Actor-aware mutex with mandatory timeout (default 30s). Fast path: no holder → grant immediately. Slow path: join FIFO waiter queue, insert a WaitTimeout timer, park. On timer expiry: if actor is still in waiters, unpark it with LockTimeout. On guard drop: pop next waiter, grant, unpark.

io
Layer 1 · machinery

Two background OS threads: an epoll thread (waits on fds with EPOLLONESHOT; on ready, pushes FdReady completion) and a pool thread (runs blocking closures inside catch_unwind; pushes Blocking completion). Both write a wake pipe byte to stir the scheduler. Completions are drained inside schedule_loop.

timer
Layer 0 · primitive

BinaryHeap<Reverse<Entry>> = min-heap by deadline. Two Reason variants: Sleep (unpark unconditionally) and WaitTimeout (call target.on_timeout()). No cancellation — stale entries are no-ops on pop. Entries inserted by sleep() and mutex::lock_timeout().

scheduler
Layer 2 · public facade

Thin facade. Exposes spawn, yield_now, park_current, unpark, sleep, block_on_io, wait_readable, wait_writable, run. All delegate to runtime. Also owns JoinHandle and the NoPreempt RAII guard.

supervisor
Layer 0 · primitive

Just the Signal enum: Exit(Pid) or Panic(Pid, Box<dyn Any+Send>). No restart logic — that's user-space policy. Signals are delivered via the supervisor actor's own channel (Sender<Signal> stored in the child's slot).

Who Imports What

The critical insight: runtime.rs is the hub. Every substantive module either feeds into it or is orchestrated by it. scheduler.rs is purely a facade — it imports runtime and re-exports it through the public API.

runtime.rs SharedState · schedule_loop stack Stack::new() context switch fns preempt RDTSC + hook actor trampoline timer min-heap io epoll + pool supervisor Signal enum channel calls unpark() mutex calls unpark() scheduler.rs / lib.rs public API re-exports · GlobalAlloc runtime calls unpark() via scheduler channel/mutex call unpark() directly

Circular dependency: channel and mutex call scheduler::unpark(), which calls into runtime. And runtime's schedule_loop resumes actors that run channel/mutex code. This is intentional — it's the cooperative unpark mechanism. It works because unpark() never blocks and preemption is disabled while holding any smarm internal lock.

What Happens When You Call run(f)

Starting from user code calling smarm::run(|| { ... }). The single-threaded run() is a wrapper around runtime::init(Config::exact(1)).run(f).

1

Install panic hook (once)

A OnceLock guard installs a custom panic hook that suppresses output inside actor context. Without this, concurrent actor panics can deadlock Rust's default backtrace printer (non-reentrant internal lock). The previous hook is chained for panics outside actors.

2

Start IoThread io.rs

Creates a wake pipe (non-blocking O_NONBLOCK). Creates an epollfd. Creates a shutdown pipe and registers it in the epollfd. Spawns the epoll thread (epoll_wait loop) and the pool thread (blocking-work mpsc receiver). Both share a completion VecDeque behind a mutex.

3

Install RUNTIME thread-local runtime.rs

Arc<RuntimeInner> is cloned into the calling thread's RUNTIME thread-local. This makes with_runtime() work on the calling thread immediately — needed for the next step.

4

Spawn initial actor scheduler.rs

Calls scheduler::spawn(f). This locks SharedState, allocates a slot, creates a Stack via mmap, calls init_actor_stack() to write the initial register frame (trampoline address + 6 zero GPR slots), stores the closure in pending_closures, pushes the PID to the run queue, returns a JoinHandle.

5

Spawn N-1 OS scheduler threads

For each extra thread: clone Arc<RuntimeInner>, spawn OS thread, set RUNTIME and SCHED_SLOT thread-locals, enter schedule_loop. Thread 0 is the calling thread.

6

Enter schedule_loop on thread 0 runtime.rs

This is a loop { drain → pop → resume → handle-intent }. Thread 0 blocks here until the run queue is empty and no timers or IO are pending. All actors run inside this loop. This call does not return until the program is done.

7

Shutdown sequence

All scheduler threads return from schedule_loop. OS threads are joined. IoThread::drop() is called: writes shutdown pipe → epoll thread exits; drops the mpsc sender → pool thread exits; closes all fds. SharedState is cleared for potential next run() call.

The Yield → Schedule → Resume Cycle

This is the heartbeat of the entire runtime. Every context switch follows exactly this path, whether triggered by a cooperative yield, preemption, channel recv, mutex contention, or IO wait.

ACTOR STACK SCHEDULER OS THREAD SHARED STATE actor code running PREEMPTION_ENABLED = true yield triggered set YieldIntent, call switch_to_sched() x86-64 naked asm push rbx,rbp,r12-r15 save actor rsp → ACTOR_SP TL rsp swap scheduler resumes pop rbx,rbp,r12-r15 ret → back in schedule_loop() post-yield handling PREEMPTION_ENABLED = false check is_actor_done() read YieldIntent lock shared save actor.sp if Yield: push run_queue if Park: state=Parked if Done: finalize_actor pop next actor drain timers+IO first run_queue.pop_front() resume actor set TLs → switch_to_actor() rsp swap actor resumes exactly where it yielded

The 6 Yield Sources

Source Intent set Who re-queues Notes
yield_now() Yield Scheduler immediately Actor stays Runnable; pushed back to queue tail
Allocator preemption Yield Scheduler immediately RDTSC check in maybe_preempt() triggers switch_to_scheduler()
channel::recv() (empty) Park channel::send()unpark() Receiver PID stored in channel's parked_receiver
mutex::lock() (contended) Park MutexGuard::drop() or timer timeout FIFO waiter queue; timeout via WaitTimeout timer entry
sleep(d) Park Timer heap → schedule_loop drain Inserts Reason::Sleep entry; scheduler unparks on pop
wait_readable/writable(fd) Park epoll thread → completion queue → scheduler EPOLLONESHOT; one ADD → one wakeup → one DEL per call

New Actor From First Resume

Spawning is the trickiest part of the runtime. An actor's first resume is fundamentally different from subsequent ones because we can't "call" into a new stack — we have to ret into it.

1

scheduler::spawn(f) called

Allocates a slot from free list or grows the slots vec. Assigns Pid(index, generation). Creates a Stack (64 KB mmap + guard page).

2

Initial stack frame written context::init_actor_stack()

Starting from top & ~15 - 8 (aligned), pushes downward: the trampoline function pointer as the ret address, then 6 zero words for the callee-saved registers. The resulting rsp is stored as actor.sp. No actual function call has happened yet.

high addr ← top
  top-8:  &trampoline   ← will be popped by 'ret'
  top-16: 0             ← rbx
  top-24: 0             ← rbp
  top-32: 0             ← r12
  top-40: 0             ← r13
  top-48: 0             ← r14
  top-56: 0             ← r15  ← initial rsp stored here
3

Closure stored separately

The closure Box<dyn FnOnce() + Send> goes into SharedState::pending_closures keyed by PID — not on the actor's stack. This is because we can't pass it via a register during first resume. The PID is pushed to the run queue; slot state is Runnable.

4

Scheduler picks up the PID, prepares first resume

Before calling switch_to_actor(), the scheduler pops the closure from pending_closures and writes it to the CURRENT_ACTOR_BOX thread-local. Then sets ACTOR_SP, sets CURRENT_PID, arms the timeslice, enables preemption.

5

First context switch lands in trampoline()

switch_to_actor() saves the scheduler's GPRs, loads actor.sp as the new rsp, pops the 6 zero words (restoring the "saved" registers to zero), then rets — which pops the trampoline address from the stack and jumps to it. We're now executing on the actor's stack.

6

trampoline() reads the closure and runs it

Takes the closure from CURRENT_ACTOR_BOX thread-local (consuming it — subsequent resumes skip this). Calls it inside panic::catch_unwind(AssertUnwindSafe(f)). The actor's code runs normally from here. Any yield (channel, mutex, preemption) calls switch_to_scheduler(); the scheduler saves actor state, processes intent, loops.

7

Actor returns → trampoline handles completion

If catch_unwind returns Ok(()), outcome is Exit. If it returns Err(payload), outcome is Panic(payload). Either way, outcome is written to LAST_OUTCOME thread-local, ACTOR_DONE is set to true, then switch_to_scheduler() is called for the last time. Scheduler sees is_actor_done() == true, calls finalize_actor(): delivers Signal to supervisor, unparks joiners, reclaims slot.

Allocator-Driven Timeslicing

How it works

The PreemptingAllocator is installed as the process's #[global_allocator]. Its alloc(), alloc_zeroed(), and realloc() all call maybe_preempt() before delegating to the system allocator.

maybe_preempt() decrements a thread-local counter. Every 128 allocations (default), it reads RDTSC. If rdtsc() - timeslice_start > 300_000 cycles (~100µs at 3 GHz) and PREEMPTION_ENABLED == true, it calls switch_to_scheduler().

The check!() macro calls the same maybe_preempt() function — for tight loops that make no allocations.

Invariant: preemption must be off when holding smarm locks

If preemption fired while the scheduler held SharedState, the context switch would try to re-acquire the same mutex → deadlock. smarm prevents this with:

  • PREEMPTION_ENABLED = false in the scheduler loop before/after switch_to_actor()
  • with_shared() saves and disables preemption while the mutex is held
  • NoPreempt RAII guard used in channel/mutex slow paths
  • trace::record() also disables preemption (it can allocate)

Known gap: tight no-alloc loops are invisible without explicit check!() calls. This is documented and by design — such loops are uncommon in message-passing workloads.

// preempt.rs — simplified
pub fn maybe_preempt() {
    ALLOC_COUNT.with(|c| {
        let n = c.get();
        if n == 0 {
            c.set(ACTIVE_ALLOC_INTERVAL.with(|i| i.get()));  // reset counter
            if PREEMPTION_ENABLED.with(|e| e.get()) {
                let elapsed = rdtsc() - TIMESLICE_START.with(|s| s.get());
                if elapsed > ACTIVE_TIMESLICE_CYCLES.with(|i| i.get()) {
                    unsafe { switch_to_scheduler() };  // YieldIntent::Yield
                }
            }
        } else {
            c.set(n - 1);
        }
    });
}

Two Background Threads, One Wake Pipe

Actor calls wait_readable(fd) or block_on_io(f) → park_current() → state = Parked epoll thread epoll_wait(-1) loop EPOLLONESHOT per fd on ready: push FdReady write wake_pipe on shutdown pipe: exit pool thread mpsc::recv() loop catch_unwind(closure) push Blocking result write wake_pipe tx drop → exit completions Arc<Mutex<VecDeque>> FdReady { fd, events } Blocking { pid, result } drained by schedule_loop scheduler poll(wake_fd) drain completions FdReady → lookup waiters[fd] unpark(pid) Blocking → store in slot, unpark epoll_ctl ADD submit(closure) drain wake pipe write
📎

epoll_ctl ADD/DEL is called by the scheduler thread directly on the epollfd — this is legal per the epoll_ctl(2) man page even while the epoll thread is inside epoll_wait. Avoids needing a second command channel.

Things That Would Bite You

Lost-wakeup window

Between registering as a channel's parked_receiver and calling park_current(), a sender could call unpark(). At that moment the actor is still Runnable, so unpark() sets pending_unpark = true instead of re-queuing. The scheduler checks this flag after the Park yield and re-queues immediately rather than parking. This flag also protects epoll and mutex paths.

std::thread::sleep inside actor

Blocks the entire OS scheduler thread, starving every actor assigned to that thread. There's no detection. Use smarm::sleep(d) instead.

Allocations while holding SharedState

The with_shared() helper disables preemption while the mutex is held. But any code path that allocates inside with_shared and then tries to acquire SharedState again will deadlock. All internal smarm code is carefully structured to avoid this.

Global run queue mutex

All N scheduler threads contend on a single Mutex<SharedState>. This is the primary scalability ceiling — visible in the benchmark suite as "tokio-favored" scenarios. Identified, documented, deferred. The fix would be per-thread deques with work stealing.

No timer cancellation

When a mutex lock is granted before its timeout, the timer entry stays in the heap. It fires eventually, the callback sees "actor is no longer waiting" and no-ops. Cost is ~32 bytes and a few cycles per stale entry. Bounded by one entry per parked actor.

fd leak on actor death during IO wait

If an actor dies while waiting on an fd, the epoll registration is leaked. EPOLLONESHOT bounds damage to one stale wakeup, which the scheduler drops when it can't find the PID in waiters. Noted in io.rs as a known gap for a future pass.

XMM registers not saved in context switch

This is intentional and correct. XMM0–15 are caller-saved in SysV AMD64 ABI. Every yield passes through a Rust call site, so the compiler has already spilled live XMM values to the actor's stack before we get to the naked asm. They're restored when the actor resumes because they're on its own stack.

panic = unwind is required

The trampoline uses catch_unwind to intercept actor panics before they reach the naked assembly shim. If a user sets panic = abort, panics kill the process instead of being caught — the supervision tree collapses to process death. This is documented and the profile is set in Cargo.toml.